After reading this page, you will have an overall sense about the crimes around Columbia University campus (including CUIMC). We pictured the proportion of each level of crimes, and the charactristic of suspects and victims. We also focused on the total crime numbers across the years, showing a pattern of the fluctation of offense number with the months and times, so that we can have an idea about when it will be more dangerous and need more caution. At the same time, we also captured some interesting point of offense number.
filter_data = read_csv("../data/full_filter_data.csv")
## Rows: 291052 Columns: 20
## ── Column specification ──────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): boro, success_fail, level, description, location, susp_age, susp_race, susp_sex, vic_age...
## dbl (9): id, year, month, day, hour, minute, second, latitude, longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
p <- c() # a list to save plots
p[[1]] <- ggplot(aes(x = level, fill = level), data = filter_data) +
geom_bar() +
labs(x = "Crime Level", y = "Number") + guides(fill = "none")
p[[2]] <- ggplot(aes(x = success_fail, fill = success_fail), data =
filter_data)+
geom_bar()+
labs(x = "Crime Completed or Not", y = "Number")+ guides(fill = "none")
p[[1]] + p[[2]]
The number of misdemeanor crime is the most (62.1%), which includes simple assault, petty theft, drug possession, indecent exposure, etc. While the number of felony is approximately half of misdemeanor (28.7%), which includes homicide offense, rubbery, burglary, sexual offense, drug crime, etc. The least type is violation (10%), including littering, drinking in public, walking unleashed dog, etc. Here exists a selection bias, most of people would not report a violation to the police. This figure does not indicate violation crimes is the least type in reality. At the meanwhile, nearly all (98%) of the crimes are completed – do not put your faith in luck that someone can help you, try to avoid them as much as you can! Next, we focus on the portrait of suspects and victims.
p[[3]] <- filter_data %>%
filter(susp_age %in% c("<18", "18-24", "25-44", "45-64", "65+",
"UNKNOWN")) %>%
mutate(susp_age = as.factor(susp_age)) %>%
count(susp_age, level) %>%
plot_ly(x = ~ susp_age, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>%
layout(title = "Suspects' Age Group", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[4]] <- filter_data %>%
filter (susp_race != "(null)") %>%
mutate(susp_age = as.factor(susp_race)) %>%
count(susp_race, level) %>%
plot_ly(x = ~ susp_race, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>%
layout(title = "Suspects' Race", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[5]] <- filter_data %>%
filter (susp_sex != "(null)") %>%
mutate(susp_sex = as.factor(susp_sex)) %>%
count(susp_sex, level) %>%
mutate(susp_sex = recode(susp_sex, U = "Unknown", F = "Female", "M" = "Male")) %>%
plot_ly(x = ~ susp_sex, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>%
layout(title = "Suspects' Sex", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[6]] <- filter_data %>%
filter(vic_age %in% c("<18", "18-24", "25-44", "45-64", "65+",
"UNKNOWN")) %>%
mutate(vic_age = as.factor(vic_age)) %>%
count(vic_age, level) %>%
plot_ly(x = ~ vic_age, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>%
layout(title = "Victims' Age Group", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[7]] <- filter_data %>%
filter (vic_race != "(null)") %>%
mutate(vic_age = as.factor(vic_race)) %>%
count(vic_race, level) %>%
plot_ly(x = ~ vic_race, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>%
layout(title = "Victims' Race", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[8]] <- filter_data %>%
filter (vic_sex %in% c("D", "E", "F", "M")) %>%
mutate(vic_sex = as.factor(vic_sex)) %>%
count(vic_sex, level) %>%
mutate(vic_sex = recode(vic_sex, F = "Female", "M" = "Male")) %>%
plot_ly(x = ~ vic_sex, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>%
layout(title = "Victims' Sex", xaxis = list(title = ""), yaxis = list(title = "Number"))
Through the data, we can see that except unknown, suspects’ age is centered at 25-44 years old, black man. As for the victim, the age group is still centered at 25-44 years old. But the race is more evenly spread, and female number is much more than the suspect group. Next, we want to look into the fluctuation of crime cases among months and hours, to see if the case number have a correlation with time.
data <- filter_data
month_data <- data %>%
group_by(year, month) %>%
summarise(number = n()) %>%
mutate(month = as.factor(month),
year = as.factor(year))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
month_data %>%
plot_ly(y = ~number, color = ~month, type = "box", colors = "viridis") %>%
layout(title = "Crime Number Across Months", xaxis = list(title = "Month"), yaxis = list(title = "Number"))
We can roughly see that, in summer there are more crimes, especially from May to August. December and February have fewer cases. We will verify if there is a significant difference between months in Statistical testing part.
hour_data = data %>%
group_by(year, month, hour) %>%
summarise(number = n()) %>%
mutate(hour = as.factor(hour),
hour = fct_inseq(hour))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
hour_data %>%
plot_ly(y = ~number, color = ~hour, type = "box", colors = "viridis") %>%
layout(title = "Crime Number Across Hours", xaxis = list(title = "Hour"), yaxis = list(title = "Number"))
Obvious pattern. It’s counter-intuitive that the time with most cases is not in the midnight, but in the afternoon.